{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 我们先来自己看一下这些模块的功能，BERTopic也只是把这些功能拼装起来了而已"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 20个句子，一部分讨论天气，一部分讨论自然语言处理\n",
    "sentences = [\n",
    "    \"The weather today is sunny with clear skies.\",\n",
    "    \"Tomorrow's forecast predicts heavy rain and thunderstorms.\",\n",
    "    \"I love watching the sunset during mild weather evenings.\",\n",
    "    \"Natural language processing is a field of study in computer science.\",\n",
    "    \"NLP techniques are widely used in text analysis and language understanding.\",\n",
    "    \"Weather forecasting relies on complex algorithms and data analysis.\",\n",
    "    \"Understanding weather patterns is crucial for agriculture and disaster management.\",\n",
    "    \"NLP algorithms can help in sentiment analysis of social media posts.\",\n",
    "    \"The weather can be unpredictable, especially during transitional seasons.\",\n",
    "    \"NLP models like BERT and GPT-3 have revolutionized language understanding tasks.\",\n",
    "    \"Extreme weather events like hurricanes and tornadoes require advanced prediction models.\",\n",
    "    \"NLP is used in virtual assistants like Siri and Alexa to understand human commands.\",\n",
    "    \"Climate change is affecting global weather patterns.\",\n",
    "    \"NLP can aid in machine translation, making communication across languages easier.\",\n",
    "    \"Weather satellites provide real-time data for meteorologists to analyze.\",\n",
    "    \"Semantic analysis is an important aspect of natural language processing.\",\n",
    "    \"Weather phenomena such as El Niño impact weather worldwide.\",\n",
    "    \"Part of speech tagging is a fundamental task in NLP.\",\n",
    "    \"The weather in coastal areas is often influenced by ocean currents.\",\n",
    "    \"NLP helps in chatbots to generate human-like responses.\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SentenceTransformer(\n",
       "  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel \n",
       "  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
       "  (2): Normalize()\n",
       ")"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
    "embedding_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 384) [[-0.00087368  0.07574054  0.11562102 ... -0.02699839 -0.13160652\n",
      "   0.07899172]\n",
      " [-0.0645291  -0.0372324   0.11404095 ... -0.03348763 -0.07219101\n",
      "   0.05458359]\n",
      " [ 0.05412416  0.00382094  0.12769882 ...  0.03218497 -0.08884915\n",
      "   0.02998092]\n",
      " ...\n",
      " [ 0.01936751  0.01480619  0.02865424 ...  0.11244407  0.02719994\n",
      "   0.03334687]\n",
      " [ 0.00312427 -0.00348332  0.14544894 ...  0.01540684 -0.01159542\n",
      "   0.10587737]\n",
      " [-0.06357921 -0.03346658  0.08694217 ...  0.15366903  0.04320052\n",
      "  -0.01385405]]\n"
     ]
    }
   ],
   "source": [
    "# 展示一下Embedding的能力\n",
    "embeddings = embedding_model.encode(sentences)\n",
    "print(embeddings.shape, embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 降维"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在创建了文档的数字表示之后，我们必须降低这些表示的维数。由于维数灾难的存在，集群模型通常难以处理高维数据。有很多方法可以降低维数，如 PCA，但是在 BERTopic 中选择默认的 UMAP。这是一种技术，可以保持一些数据集的局部和全局结构时，降低其维数。\n",
    "https://maartengr.github.io/BERTopic/algorithm/algorithm.html#1-embed-documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from umap import UMAP\n",
    "umap_model = UMAP(n_components=2, metric='cosine')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "UMAP官方文档 https://umap-learn.readthedocs.io/en/latest/basic_usage.html\n",
    "\n",
    "视频教程：https://www.bilibili.com/video/BV1qB4y1p7CF/?spm_id_from=333.337.search-card.all.click&vd_source=eace37b0970f8d3d597d32f39dec89d8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 2) [[12.812426  -7.6344767]\n",
      " [12.373646  -8.142993 ]\n",
      " [13.022554  -7.2421017]\n",
      " [ 5.875539  -5.986935 ]\n",
      " [ 6.0638843 -5.442585 ]\n",
      " [12.727866  -9.012307 ]\n",
      " [13.442886  -8.853996 ]\n",
      " [ 5.5039425 -6.468625 ]\n",
      " [13.856337  -7.8624372]\n",
      " [ 5.3917537 -5.081037 ]\n",
      " [12.293061  -8.679683 ]\n",
      " [ 4.640278  -5.9602914]\n",
      " [13.792018  -8.339484 ]\n",
      " [ 4.832395  -5.329006 ]\n",
      " [12.932469  -8.515838 ]\n",
      " [ 6.287607  -6.1892047]\n",
      " [13.300953  -8.028255 ]\n",
      " [ 5.438404  -5.627225 ]\n",
      " [13.618681  -7.5132465]\n",
      " [ 5.104162  -5.9107757]]\n"
     ]
    }
   ],
   "source": [
    "# 转换为低维表示\n",
    "reduced_embeddings = umap_model.fit_transform(embeddings)\n",
    "print(reduced_embeddings.shape, reduced_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.collections.PathCollection at 0x1a981b24af0>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAi8AAAGdCAYAAADaPpOnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwWElEQVR4nO3df3RU9Z3/8dckwAzaZExCkpmsARN0iTH+4MeSQtkeF0OJq1FW1h6rgFiEkuP6A1I0bMUYtUbkiK5tRe2hVA7anvVUCqFn40pAq6eBuETEbCQgBkWYhFOQGX40MWTu9w++mWXIDzI4NzN35vk45/5x73xu5j1nSufl5/O5n4/NMAxDAAAAFpEQ6QIAAABCQXgBAACWQngBAACWQngBAACWQngBAACWQngBAACWQngBAACWQngBAACWMiTSBYSb3+/XoUOHlJSUJJvNFulyAADAABiGoePHjysrK0sJCf33rcRceDl06JCys7MjXQYAALgABw4c0KWXXtpvm5gLL0lJSZLOfPjk5OQIVwMAAAbC5/MpOzs78Dven5gLL91DRcnJyYQXAAAsZiBTPpiwCwAALIXwAgAALMXU8HLZZZfJZrMFHc8880y/97S3t+u+++5TWlqavvOd72jmzJlqa2szs0wAAGAhpve8PPHEE/J4PIHj/vvv77f9okWLVF1drTfffFPvvfeeDh06pNtuu83sMgEAgEWYPmE3KSlJLpdrQG29Xq9Wr16tN954Q1OnTpUkrVmzRldeeaW2bdum7373u2aWCgAALMD0npdnnnlGaWlpGjt2rFasWKHTp0/32XbHjh3q7OxUUVFR4FpeXp5Gjhypuro6s0sFAAAWYGrPywMPPKBx48YpNTVVf/nLX7R06VJ5PB6tXLmy1/atra0aNmyYLrnkkqDrmZmZam1t7fWejo4OdXR0BM59Pl/Y6gcAANEn5J6X8vLyHpNwzz12794tSVq8eLGuv/56XXPNNVq4cKGee+45/eIXvwgKG99WVVWVnE5n4GB1XQAAYlvIPS9lZWWaO3duv21yc3N7vV5YWKjTp09r//79GjNmTI/XXS6XvvnmGx07diyo96Wtra3PeTNLly7V4sWLA+fdK/TFgy6/ofqWozp8vF0ZSQ5NzElVYgL7OQEAYlvI4SU9PV3p6ekX9GY7d+5UQkKCMjIyen19/PjxGjp0qGprazVz5kxJUnNzs7788ktNmjSp13vsdrvsdvsF1WNlNY0eVVY3yeNtD1xzOx2qKMlXcYE7gpUBAGAu0ybs1tXV6YUXXtDHH3+szz//XK+//roWLVqkWbNmKSUlRZJ08OBB5eXlqb6+XpLkdDo1b948LV68WFu3btWOHTt0zz33aNKkSTxpdJaaRo9K1zUEBRdJavW2q3Rdg2oaPRGqDAAA85k2Yddut+v3v/+9Hn/8cXV0dCgnJ0eLFi0KGuLp7OxUc3OzTp06Fbj2/PPPKyEhQTNnzlRHR4emT5+ul156yawyLafLb6iyuklGL68ZkmySKqubNC3fxRASACAm2QzD6O130LJ8Pp+cTqe8Xm9MbsxYt++IfvTrbedt97v539Wk0WmDUBEAAN9eKL/f7G1kMYePt5+/UQjtAACwGsKLxWQkOcLaDgAAqyG8WMzEnFS5nQ71NZvFpjNPHU3MSR3MsgAAGDSEF4tJTLCpoiRfknoEmO7zipJ8JusCAGIW4cWCigvcWjVrnFzO4KEhl9OhVbPGsc4LACCmmb6rNMxRXODWtHwXK+wCAOIO4cXCEhNsPA4NAIg7DBsBAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLYXuAMOryG+w1BACAyQgvYVLT6FFldZM83vbANbfToYqSfHZ5BgAgjBg2CoOaRo9K1zUEBRdJavW2q3Rdg2oaPRGqDACA2EN4+Za6/IYqq5tk9PJa97XK6iZ1+XtrAQAAQkV4+ZbqW4726HE5myHJ421XfcvRwSsKAIAYRnj5lg4f7zu4XEg7AADQP8LLt5SR5AhrOwAA0D/Cy7c0MSdVbqdDfT0QbdOZp44m5qQOZlkAAMQswsu3lJhgU0VJviT1CDDd5xUl+az3AgBAmBBewqC4wK1Vs8bJ5QweGnI5HVo1a1xMr/PS5TdUt++INuw8qLp9R3iqCgBgOhapC5PiArem5bviaoVdFuYDAESCzTCMmPpPZZ/PJ6fTKa/Xq+Tk5EiXE7O6F+Y793883VEt1nucAADhFcrvN8NGCBkL8wEAIonwgpCxMB8AIJIILwgZC/MBACLJ1PBy2WWXyWazBR3PPPNMv/dcf/31Pe5ZuHChmWUiRCzMBwCIJNOfNnriiSc0f/78wHlSUtJ575k/f76eeOKJwPlFF11kSm24MN0L87V623ud92LTmcfEWZgPAGAG08NLUlKSXC5XSPdcdNFFId+DwdO9MF/pugbZpKAAw8J8AACzmT7n5ZlnnlFaWprGjh2rFStW6PTp0+e95/XXX9eIESNUUFCgpUuX6tSpU3227ejokM/nCzpgvnhemA8AEFmm9rw88MADGjdunFJTU/WXv/xFS5culcfj0cqVK/u8584779SoUaOUlZWlXbt26ZFHHlFzc7PeeuutXttXVVWpsrLSrI+AfsTjwnwAgMgLeZG68vJyLV++vN82n376qfLy8npc/81vfqOf/OQnOnHihOx2+4Deb8uWLbrhhhv02WefafTo0T1e7+joUEdHR+Dc5/MpOzubReoAALCQUBapC7nnpaysTHPnzu23TW5ubq/XCwsLdfr0ae3fv19jxowZ0PsVFhZKUp/hxW63DzgIAQAA6ws5vKSnpys9Pf2C3mznzp1KSEhQRkZGSPdIktvNHAoAAGDinJe6ujpt375d//RP/6SkpCTV1dVp0aJFmjVrllJSUiRJBw8e1A033KC1a9dq4sSJ2rdvn9544w398z//s9LS0rRr1y4tWrRI3//+93XNNdeYVSoAALAQ08KL3W7X73//ez3++OPq6OhQTk6OFi1apMWLFwfadHZ2qrm5OfA00bBhw7R582a98MILOnnypLKzszVz5kw9+uijZpUJAAAshl2lY0CX3+CJHwCApZk6YRfRpabRo8rqpqCNEt1OhypK8llrBQAQk9iY0cJqGj0qXdfQY4fnVm+7Stc1qKbRE6HKAAAwD+HForr8hiqrm3rdW6j7WmV1k7r8MTUqCAAA4cWq6luO9uhxOZshyeNtV33L0cErCgCAQUB4sajDx/sOLhfSDgAAqyC8WFRGkuP8jUJoBwCAVRBeLGpiTqrcTof6eiDapjNPHU3MSR3MsgAAMB3hxaISE2yqKMmXpB4Bpvu8oiSf9V4AADGH8GJhxQVurZo1Ti5n8NCQy+nQqlnjWOcFABCTWKTO4ooL3JqW72KFXQBA3CC8xIDEBJsmjU6LdBkAAAwKho0AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClsDFjFOjyG+wKDQDAABFeIqym0aPK6iZ5vO2Ba26nQxUl+SoucEewMgAAohPDRhFU0+hR6bqGoOAiSa3edpWua1BNoydClQEAEL0ILxHS5TdUWd0ko5fXuq9VVjepy99bCwAA4hfhJULqW4726HE5myHJ421XfcvRwSsKAAALYM7LAIV7Uu3h430HlwtpBwBAvCC8DIAZk2ozkhxhbQcAQLxg2Og8zJpUOzEnVW6nQ3313dh0JiBNzEm9oL//bXX5DdXtO6INOw+qbt8R5t4AAKIGPS/9ON+kWpvOTKqdlu8KeQgpMcGmipJ8la5rkE0Keo/uv1RRkh+R9V54fBsAEM1M7Xn505/+pMLCQg0fPlwpKSmaMWNGv+0Nw9Bjjz0mt9ut4cOHq6ioSHv37jWzxH6ZPam2uMCtVbPGyeUMHhpyOR1aNWtcRIICj28DAKKdaT0vf/jDHzR//nw9/fTTmjp1qk6fPq3GxsZ+73n22Wf14osv6rXXXlNOTo6WLVum6dOnq6mpSQ7H4M/9GIxJtcUFbk3Ld0XFCrtm9jQBABAupoSX06dP68EHH9SKFSs0b968wPX8/Pw+7zEMQy+88IIeffRR3XrrrZKktWvXKjMzU3/84x91xx13mFFqvwZrUm1igk2TRqd9q78RDqH0NEVDvQCA+GTKsFFDQ4MOHjyohIQEjR07Vm63WzfeeGO/PS8tLS1qbW1VUVFR4JrT6VRhYaHq6ur6vK+jo0M+ny/oCJdon1Qbbjy+DQCwAlPCy+effy5Jevzxx/Xoo49q06ZNSklJ0fXXX6+jR3ufH9La2ipJyszMDLqemZkZeK03VVVVcjqdgSM7OztMn+L/JtVK6hFgIj2p1gw8vg0AsIKQwkt5eblsNlu/x+7du+X3+yVJP/vZzzRz5kyNHz9ea9askc1m05tvvhnWD7B06VJ5vd7AceDAgbD+/WicVGuWeOtpAgBYU0hzXsrKyjR37tx+2+Tm5srjOfNEytlzXOx2u3Jzc/Xll1/2ep/L5ZIktbW1ye3+v0DQ1tam6667rs/3s9vtstvtA/wEFyaaJtWaKZof3wYAoFtI4SU9PV3p6ennbTd+/HjZ7XY1NzdrypQpkqTOzk7t379fo0aN6vWenJwcuVwu1dbWBsKKz+fT9u3bVVpaGkqZpoiWSbVm6+5pOnedFxfrvAAAooQpTxslJydr4cKFqqioUHZ2tkaNGqUVK1ZIkm6//fZAu7y8PFVVVelf/uVfZLPZ9NBDD+mpp57SFVdcEXhUOisr67zrwyC84qWnCQBgTaat87JixQoNGTJEs2fP1t/+9jcVFhZqy5YtSklJCbRpbm6W1+sNnD/88MM6efKkFixYoGPHjmnKlCmqqamJyBov8S5eepoAANZjMwwjpjat8fl8cjqd8nq9Sk5OjnQ5AABgAEL5/WZjRgAAYCmEFwAAYCmEFwAAYCmEFwAAYCmmPW0EnK3Lb/DoNQAgLAgvMF1No6fHonduFr0DAFwgho1gqppGj0rXNQQFF0lq9bardF2Daho9EaoMAGBVhBeYpstvqLK6Sb0tJNR9rbK6SV3+mFpqCABgMsILTFPfcrRHj8vZDEkeb7vqW44OXlEAAMsjvMA0h4/3HVwupB0AABLhBSbKSBrYnlQDbQcAgER4gYkm5qTK7XSorweibTrz1NHEnNTBLAsAYHGEF5gmMcGmipJ8SeoRYLrPK0ryWe8FABASwgtMVVzg1qpZ4+RyBg8NuZwOrZo1jnVeAAAhY5E6mK64wK1p+S5W2AUAhAXhBYMiMcGmSaPTIl0GACAGMGwEAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshfACAAAshY0ZY0yX32D3ZgBATCO8xJCaRo8qq5vk8bYHrrmdDlWU5Ku4wB3BygAACB+GjWJETaNHpesagoKLJLV621W6rkE1jZ4IVQYAQHgRXmJAl99QZXWTjF5e675WWd2kLn9vLQAAsBZTw8uf/vQnFRYWavjw4UpJSdGMGTP6bT937lzZbLago7i42MwSY0J9y9EePS5nMyR5vO2qbzk6eEUBAGAS0+a8/OEPf9D8+fP19NNPa+rUqTp9+rQaGxvPe19xcbHWrFkTOLfb7WaVGDMOH+87uFxIOwAAopkp4eX06dN68MEHtWLFCs2bNy9wPT8//7z32u12uVwuM8qKWRlJjrC2AwAgmpkybNTQ0KCDBw8qISFBY8eOldvt1o033jignpd3331XGRkZGjNmjEpLS3XkyJF+23d0dMjn8wUd8WZiTqrcTof6eiDapjNPHU3MSR3MsgAAMIUp4eXzzz+XJD3++ON69NFHtWnTJqWkpOj666/X0aN9z7soLi7W2rVrVVtbq+XLl+u9997TjTfeqK6urj7vqaqqktPpDBzZ2dlh/zzRLjHBpoqSM71a5waY7vOKknzWewEAxASbYRgDfgSlvLxcy5cv77fNp59+qoaGBt1111165ZVXtGDBAklnekguvfRSPfXUU/rJT34yoPf7/PPPNXr0aG3evFk33HBDr206OjrU0dEROPf5fMrOzpbX61VycvIAP1lsYJ0XAIBV+Xw+OZ3OAf1+hzTnpaysTHPnzu23TW5urjyeM2uKnD3HxW63Kzc3V19++eWA3y83N1cjRozQZ5991md4sdvtTOr9/4oL3JqW72KFXQBATAspvKSnpys9Pf287caPHy+73a7m5mZNmTJFktTZ2an9+/dr1KhRA36/r776SkeOHJHbTa/BQCUm2DRpdFqkywAAwDSmzHlJTk7WwoULVVFRof/+7/9Wc3OzSktLJUm33357oF1eXp7Wr18vSTpx4oSWLFmibdu2af/+/aqtrdWtt96qyy+/XNOnTzejTAAAYEGmrfOyYsUKDRkyRLNnz9bf/vY3FRYWasuWLUpJSQm0aW5ultfrlSQlJiZq165deu2113Ts2DFlZWXpBz/4gZ588kmGhQAAQEBIE3atIJQJPwAAIDqE8vvN3kYAAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSTAsv7777rmw2W6/Hhx9+2Od97e3tuu+++5SWlqbvfOc7mjlzptra2swqEwAAWIxp4WXy5MnyeDxBx7333qucnBxNmDChz/sWLVqk6upqvfnmm3rvvfd06NAh3XbbbWaVCQAALGaIWX942LBhcrlcgfPOzk5t2LBB999/v2w2W6/3eL1erV69Wm+88YamTp0qSVqzZo2uvPJKbdu2Td/97nfNKhcAAFjEoM152bhxo44cOaJ77rmnzzY7duxQZ2enioqKAtfy8vI0cuRI1dXV9XpPR0eHfD5f0AEAAGLXoIWX1atXa/r06br00kv7bNPa2qphw4bpkksuCbqemZmp1tbWXu+pqqqS0+kMHNnZ2eEsGwAARJmQw0t5eXmfE3G7j927dwfd89VXX+ntt9/WvHnzwlZ4t6VLl8rr9QaOAwcOhP09AABA9Ah5zktZWZnmzp3bb5vc3Nyg8zVr1igtLU233HJLv/e5XC598803OnbsWFDvS1tbW9D8mbPZ7XbZ7fYB1Q4AAKwv5PCSnp6u9PT0Abc3DENr1qzRnDlzNHTo0H7bjh8/XkOHDlVtba1mzpwpSWpubtaXX36pSZMmhVoqAACIQabPedmyZYtaWlp077339njt4MGDysvLU319vSTJ6XRq3rx5Wrx4sbZu3aodO3bonnvu0aRJk3jSCAAASDLxUeluq1ev1uTJk5WXl9fjtc7OTjU3N+vUqVOBa88//7wSEhI0c+ZMdXR0aPr06XrppZfMLhMAAFiEzTAMI9JFhJPP55PT6ZTX61VycnKkywEAAAMQyu83exsBAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLIbwAAABLMX17AAAAcH5dfkP1LUd1+Hi7MpIcmpiTqsQEW6TLikqEFwAAIqym0aPK6iZ5vO2Ba26nQxUl+SoucEewsujEsBEAABFU0+hR6bqGoOAiSa3edpWua1BNoydClUUvwgsAABHS5TdUWd2k3nZI7r5WWd2kLn9M7aH8rRFeAACIkPqWoz16XM5mSPJ421XfcnTwirIAwgsAABFy+HjfweVC2sULwgsAABGSkeQIa7t4QXgBACBCJuakyu10qK8Hom0689TRxJzUwSwr6hFeAACIkMQEmypK8iWpR4DpPq8oyWe9l3MQXgAAiKDiArdWzRonlzN4aMjldGjVrHGs89ILFqkDACDCigvcmpbvYoXdASK8AAAQBRITbJo0Oi3SZVgCw0YAAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSCC8AAMBSTAsv7777rmw2W6/Hhx9+2Od9119/fY/2CxcuNKtMAADiVpffUN2+I9qw86Dq9h1Rl9+IdEkDYtrGjJMnT5bH4wm6tmzZMtXW1mrChAn93jt//nw98cQTgfOLLrrIlBoBAIhXNY0eVVY3yeNtD1xzOx2qKMlXcYE7gpWdn2nhZdiwYXK5XIHzzs5ObdiwQffff79stv63+L7ooouC7gUAAOFT0+hR6boGndvP0uptV+m6Bq2aNS6qA8ygzXnZuHGjjhw5onvuuee8bV9//XWNGDFCBQUFWrp0qU6dOtVn246ODvl8vqADAAD0rstvqLK6qUdwkRS4VlndFNVDSKb1vJxr9erVmj59ui699NJ+2915550aNWqUsrKytGvXLj3yyCNqbm7WW2+91Wv7qqoqVVZWmlEyAAAxp77laNBQ0bkMSR5vu+pbjmrS6LTBKywEIfe8lJeX9zkRt/vYvXt30D1fffWV3n77bc2bN++8f3/BggWaPn26rr76at11111au3at1q9fr3379vXafunSpfJ6vYHjwIEDoX4kAADixuHjfQeXC2kXCSH3vJSVlWnu3Ln9tsnNzQ06X7NmjdLS0nTLLbeE+nYqLCyUJH322WcaPXp0j9ftdrvsdnvIfxcAgHiUkeQIa7tICDm8pKenKz09fcDtDcPQmjVrNGfOHA0dOjTUt9POnTslSW539E4cAgDAKibmpMrtdKjV297rvBebJJfToYk5qYNd2oCZPmF3y5Ytamlp0b333tvjtYMHDyovL0/19fWSpH379unJJ5/Ujh07tH//fm3cuFFz5szR97//fV1zzTVmlwoAQMxLTLCpoiRf0pmgcrbu84qSfCUm9P9kcCSZHl5Wr16tyZMnKy8vr8drnZ2dam5uDjxNNGzYMG3evFk/+MEPlJeXp7KyMs2cOVPV1dVmlwkAQNwoLnBr1axxcjmDh4ZcTkfUPyYtSTbDMKL3WagL4PP55HQ65fV6lZycHOlyAACIWl1+Q/UtR3X4eLsyks4MFUWqxyWU3+9Be1QaAIB4FE0B4VyJCbaofRy6P4QXAABMYuUl+KMZu0oDAGCC7iX4z10QrnsJ/ppGTx934nwILwAAhFksLMEfzQgvAACEWShL8CN0hBcAAMIsFpbgj2aEFwAAwiwWluCPZoQXAADCrHsJ/r4eiLbpzFNH0bwEfzQjvAAAEGaxsAR/NCO8AABgAqsvwR/NWKQOAACTFBe4NS3fFbUr7FoV4QUAABNZdQn+aMawEQAAsBTCCwAAsBTCCwAAsBTCCwAAsBTCCwAAsBTCCwAAsBTCCwAAsBTWeQEAAAPS5TeiYsE9wgsAADivmkaPKqub5PG2B665nQ5VlOQP+lYHDBsBAIB+1TR6VLquISi4SFKrt12l6xpU0+gZ1HoILwAAoE9dfkOV1U0yenmt+1pldZO6/L21MAfhBQAA9Km+5WiPHpezGZI83nbVtxwdtJoILwAAoE+Hj/cdXC6kXTgQXgAAQJ8ykhxhbRcOhBcAANCniTmpcjsd6uuBaJvOPHU0MSd10GoivAAAgD4lJthUUZIvST0CTPd5RUn+oK73QngBAAD9Ki5wa9WscXI5g4eGXE6HVs0aN+jrvLBIHQAAOK/iArem5btYYRcAAFhHYoJNk0anRboMho0AAIC1mBZe9uzZo1tvvVUjRoxQcnKypkyZoq1bt/Z7j2EYeuyxx+R2uzV8+HAVFRVp7969ZpUIAAAsyLTwcvPNN+v06dPasmWLduzYoWuvvVY333yzWltb+7zn2Wef1YsvvqiXX35Z27dv18UXX6zp06ervX3wFr4BAMBKuvyG6vYd0YadB1W378igLtMfKTbDMML+Kf/6178qPT1df/7zn/WP//iPkqTjx48rOTlZ77zzjoqKinrcYxiGsrKyVFZWpp/+9KeSJK/Xq8zMTP32t7/VHXfcMaD39vl8cjqd8nq9Sk5ODt+HAgAgykTTTs/fVii/36b0vKSlpWnMmDFau3atTp48qdOnT+uVV15RRkaGxo8f3+s9LS0tam1tDQo2TqdThYWFqqur6/O9Ojo65PP5gg4AAGJdtO30PJhMCS82m02bN2/WRx99pKSkJDkcDq1cuVI1NTVKSUnp9Z7u4aTMzMyg65mZmf0ONVVVVcnpdAaO7Ozs8H0QAACiUDTu9DyYQgov5eXlstls/R67d++WYRi67777lJGRoffff1/19fWaMWOGSkpK5PGENwkuXbpUXq83cBw4cCCsfx8AgGgTjTs9D6aQ1nkpKyvT3Llz+22Tm5urLVu2aNOmTfr6668D41YvvfSS3nnnHb322msqLy/vcZ/L5ZIktbW1ye3+v3G6trY2XXfddX2+n91ul91uD+VjAABgadG40/NgCim8pKenKz09/bztTp06JUlKSAju2ElISJDf7+/1npycHLlcLtXW1gbCis/n0/bt21VaWhpKmQAAxLRo3Ol5MJky52XSpElKSUnR3XffrY8//lh79uzRkiVL1NLSoptuuinQLi8vT+vXr5d0Zp7MQw89pKeeekobN27UJ598ojlz5igrK0szZswwo0wAACwpGnd6HkymhJcRI0aopqZGJ06c0NSpUzVhwgR98MEH2rBhg6699tpAu+bmZnm93sD5ww8/rPvvv18LFizQP/zDP+jEiROqqamRwxGbyREAgAsRjTs9DyZT1nmJJNZ5AQDEi3hd54WNGQEA0JnHj6Nhx+RQRNNOz4OJ8AIAiHtW7sGIlp2eBxO7SgMA4lo8r1RrVYQXAEDciveVaq2K8AIAiFvxvlKtVRFeAABxK95XqrUqwgsAIG7F+0q1VkV4AQDErXhfqdaqCC8AgLgV7yvVWhXhBQAQ14oL3Fo1a5xczuChIZfToVWzxkX9Oi/xiEXqAABxL15XqrUqwgsAAIrPlWqtimEjAABgKYQXAABgKYQXAABgKYQXAABgKYQXAABgKTxtBAAA+tTlN6LuEXLCCwAA6FVNo0eV1U1BO2+7nQ5VlORHdPE+ho0AAEAPNY0ela5rCAouktTqbVfpugbVNHoiVBnhBQAAnKPLb6iyuklGL691X6usblKXv7cW5iO8AACAIPUtR3v0uJzNkOTxtqu+5ejgFXUWwgsAAAhy+HjfweVC2oUb4QUAAATJSHKcv1EI7cKN8AIAAIJMzEmV2+lQXw9E23TmqaOJOamDWVYA4QUAAARJTLCpoiRfknoEmO7zipL8iK33QngBAAA9FBe4tWrWOLmcwUNDLqdDq2aNi+g6LyxSBwAAelVc4Na0fBcr7AIAAOtITLBp0ui0SJcRhGEjAABgKYQXAABgKYQXAABgKaaFlz179ujWW2/ViBEjlJycrClTpmjr1q393jN37lzZbLago7i42KwSAQCImC6/obp9R7Rh50HV7TsSsX2CrMi0Cbs333yzrrjiCm3ZskXDhw/XCy+8oJtvvln79u2Ty+Xq877i4mKtWbMmcG63280qEQCAiKhp9Kiyuilo/yC306GKkvyIPoJsFab0vPz1r3/V3r17VV5ermuuuUZXXHGFnnnmGZ06dUqNjY393mu32+VyuQJHSkqKGSUCABARNY0ela5r6LHxYau3XaXrGlTT6IlQZdZhSnhJS0vTmDFjtHbtWp08eVKnT5/WK6+8ooyMDI0fP77fe999911lZGRozJgxKi0t1ZEjR/pt39HRIZ/PF3QAABCNuvyGKqub1NsAUfe1yuomhpDOw5TwYrPZtHnzZn300UdKSkqSw+HQypUrVVNT029PSnFxsdauXava2lotX75c7733nm688UZ1dXX1eU9VVZWcTmfgyM7ONuMjAQDwrdW3HO3R43I2Q5LH2676lqODV5QFhRReysvLe0yoPffYvXu3DMPQfffdp4yMDL3//vuqr6/XjBkzVFJSIo+n7+6wO+64Q7fccouuvvpqzZgxQ5s2bdKHH36od999t897li5dKq/XGzgOHDgQykcCAGDQHD7ed3C5kHbxKqQJu2VlZZo7d26/bXJzc7VlyxZt2rRJX3/9tZKTkyVJL730kt555x299tprKi8vH9D75ebmasSIEfrss890ww039NrGbrczqRcAYAkZSY7zNwqhXbwKKbykp6crPT39vO1OnTolSUpICO7YSUhIkN/vH/D7ffXVVzpy5IjcbmZeAwCsb2JOqtxOh1q97b3Oe7HpzMaHE3NSB7s0SzFlzsukSZOUkpKiu+++Wx9//LH27NmjJUuWqKWlRTfddFOgXV5entavXy9JOnHihJYsWaJt27Zp//79qq2t1a233qrLL79c06dPN6NMAAAGVWKCTRUl+ZLOBJWzdZ9XlORHfOPDaGdKeBkxYoRqamp04sQJTZ06VRMmTNAHH3ygDRs26Nprrw20a25ultfrlSQlJiZq165duuWWW/T3f//3mjdvnsaPH6/333+fYSEAQMwoLnBr1axxcjmDh4ZcTodWzRrHOi8DYDMMI6aex/L5fHI6nfJ6vYH5NgAARJsuv6H6lqM6fLxdGUlnhoriuccllN9v01bYBQAAfUtMsGnS6LRIl2FJbMwIAAAshfACAAAshWEjAEBcYs6JdRFeAABxh12drY1hIwBAXGFXZ+sjvAAA4ga7OscGwgsAIG6wq3NsILwAAOIGuzrHBsILACBusKtzbCC8AADiRveuzn09EG3TmaeO2NU5uhFeAABxg12dYwPhBQAQV9jV2fpYpA4AEHeKC9yalu9ihV2LIrwAAOISuzpbF8NGAADAUggvAADAUggvAADAUpjzAgBAjOryGzE5KZnwAgBADKpp9KiyuiloLye306GKknzLPw7OsBEAADGmptGj0nUNPTahbPW2q3Rdg2oaPRGqLDwILwAAxJAuv6HK6iYZvbzWfa2yukld/t5aWAPhBQCAGFLfcrRHj8vZDEkeb7vqW44OXlFhRngBACCGHD7ed3C5kHbRiPACAEAMyUhynL9RCO2iEeEFAIAYMjEnVW6no8eu2d1sOvPU0cSc1MEsK6wILwAAxJDEBJsqSvIlqUeA6T6vKMm39HovhBcAAGJMcYFbq2aNk8sZPDTkcjq0atY4y6/zwiJ1AADEoOICt6blu1hhFwAAWEdigk2TRqdFuoywY9gIAABYimnhpaGhQdOmTdMll1yitLQ0LViwQCdOnOj3HsMw9Nhjj8ntdmv48OEqKirS3r17zSoRAICw6/Ibqtt3RBt2HlTdviOWXsk2WpkSXg4dOqSioiJdfvnl2r59u2pqavS///u/mjt3br/3Pfvss3rxxRf18ssva/v27br44os1ffp0tbdbdyEdAED8qGn0aMryLfrRr7fpwd/v1I9+vU1Tlm+x/F5C0cZmGEbYI+Grr76qZcuWyePxKCHhTD765JNPdM0112jv3r26/PLLe9xjGIaysrJUVlamn/70p5Ikr9erzMxM/fa3v9Udd9wxoPf2+XxyOp3yer1KTk4O34cCAKAf3Zshnvuj2j09Nhae8jFTKL/fpvS8dHR0aNiwYYHgIknDhw+XJH3wwQe93tPS0qLW1lYVFRUFrjmdThUWFqqurq7f9/L5fEEHAACDKR42Q4wmpoSXqVOnqrW1VStWrNA333yjr7/+WuXl5ZIkj6f3rrPW1lZJUmZmZtD1zMzMwGu9qaqqktPpDBzZ2dlh+hQAAAxMPGyGGE1CCi/l5eWy2Wz9Hrt379ZVV12l1157Tc8995wuuugiuVwu5eTkKDMzM6g3JhyWLl0qr9cbOA4cOBDWvw8AwPnEw2aI0SSkdV7KysrOO+k2NzdXknTnnXfqzjvvVFtbmy6++GLZbDatXLky8Pq5XC6XJKmtrU1u9/+NCba1tem6667r8/3sdrvsdnsoHwMAgLCKh80Qo0lI4SU9PV3p6ekhvUH3MNBvfvMbORwOTZs2rdd2OTk5crlcqq2tDYQVn8+n7du3q7S0NKT3BABgMHVvhtjqbe913otNZ5bmt/JmiNHEtHVefvnLX6qhoUF79uzRr371K/3bv/2bqqqqdMkllwTa5OXlaf369ZIkm82mhx56SE899ZQ2btyoTz75RHPmzFFWVpZmzJhhVpkAAHxr8bAZYjQxbXuA+vp6VVRU6MSJE8rLy9Mrr7yi2bNnB7Vpbm6W1+sNnD/88MM6efKkFixYoGPHjmnKlCmqqamRw0E3GwAgunVvhlhZ3RQ0edfldKiiJJ/HpMPIlHVeIol1XgAAkdTlN2JyM0SzhfL7zcaMAACEUaxuhhhN2JgRAABYCuEFAABYCuEFAABYCuEFAABYCuEFAABYCuEFAABYCuEFAABYCuEFAABYCuEFAABYSsytsNu924HP54twJQAAYKC6f7cHsmtRzIWX48ePS5Kys7MjXAkAAAjV8ePH5XQ6+20Tcxsz+v1+HTp0SElJSbLZ2AjLDD6fT9nZ2Tpw4ACbX0YQ30N04HuIDnwP0eHbfA+GYej48ePKyspSQkL/s1piruclISFBl156aaTLiAvJycn8n0QU4HuIDnwP0YHvITpc6Pdwvh6XbkzYBQAAlkJ4AQAAlkJ4QcjsdrsqKipkt9sjXUpc43uIDnwP0YHvIToM1vcQcxN2AQBAbKPnBQAAWArhBQAAWArhBQAAWArhBQAAWArhBSE5ePCgZs2apbS0NA0fPlxXX321/ud//ifSZcWVrq4uLVu2TDk5ORo+fLhGjx6tJ598ckD7geDC/fnPf1ZJSYmysrJks9n0xz/+Meh1wzD02GOPye12a/jw4SoqKtLevXsjU2wM6+976Ozs1COPPKKrr75aF198sbKysjRnzhwdOnQocgXHqPP9ezjbwoULZbPZ9MILL4Tt/QkvGLCvv/5a3/ve9zR06FD913/9l5qamvTcc88pJSUl0qXFleXLl2vVqlX65S9/qU8//VTLly/Xs88+q1/84heRLi2mnTx5Utdee61+9atf9fr6s88+qxdffFEvv/yytm/frosvvljTp09Xe3v7IFca2/r7Hk6dOqWGhgYtW7ZMDQ0Neuutt9Tc3KxbbrklApXGtvP9e+i2fv16bdu2TVlZWeEtwAAG6JFHHjGmTJkS6TLi3k033WT8+Mc/Drp22223GXfddVeEKoo/koz169cHzv1+v+FyuYwVK1YErh07dsyw2+3G7373uwhUGB/O/R56U19fb0gyvvjii8EpKg719T189dVXxt/93d8ZjY2NxqhRo4znn38+bO9JzwsGbOPGjZowYYJuv/12ZWRkaOzYsfr1r38d6bLizuTJk1VbW6s9e/ZIkj7++GN98MEHuvHGGyNcWfxqaWlRa2urioqKAtecTqcKCwtVV1cXwcrg9Xpls9l0ySWXRLqUuOL3+zV79mwtWbJEV111Vdj/fsxtzAjzfP7551q1apUWL16sf//3f9eHH36oBx54QMOGDdPdd98d6fLiRnl5uXw+n/Ly8pSYmKiuri79/Oc/11133RXp0uJWa2urJCkzMzPoemZmZuA1DL729nY98sgj+tGPfsRmjYNs+fLlGjJkiB544AFT/j7hBQPm9/s1YcIEPf3005KksWPHqrGxUS+//DLhZRD953/+p15//XW98cYbuuqqq7Rz50499NBDysrK4nsA/r/Ozk798Ic/lGEYWrVqVaTLiSs7duzQf/zHf6ihoUE2m82U92DYCAPmdruVn58fdO3KK6/Ul19+GaGK4tOSJUtUXl6uO+64Q1dffbVmz56tRYsWqaqqKtKlxS2XyyVJamtrC7re1tYWeA2Dpzu4fPHFF3rnnXfodRlk77//vg4fPqyRI0dqyJAhGjJkiL744guVlZXpsssuC8t7EF4wYN/73vfU3NwcdG3Pnj0aNWpUhCqKT6dOnVJCQvA/3cTERPn9/ghVhJycHLlcLtXW1gau+Xw+bd++XZMmTYpgZfGnO7js3btXmzdvVlpaWqRLijuzZ8/Wrl27tHPnzsCRlZWlJUuW6O233w7LezBshAFbtGiRJk+erKefflo//OEPVV9fr1dffVWvvvpqpEuLKyUlJfr5z3+ukSNH6qqrrtJHH32klStX6sc//nGkS4tpJ06c0GeffRY4b2lp0c6dO5WamqqRI0fqoYce0lNPPaUrrrhCOTk5WrZsmbKysjRjxozIFR2D+vse3G63/vVf/1UNDQ3atGmTurq6AnOOUlNTNWzYsEiVHXPO9+/h3NA4dOhQuVwujRkzJjwFhO25JcSF6upqo6CgwLDb7UZeXp7x6quvRrqkuOPz+YwHH3zQGDlypOFwOIzc3FzjZz/7mdHR0RHp0mLa1q1bDUk9jrvvvtswjDOPSy9btszIzMw07Ha7ccMNNxjNzc2RLToG9fc9tLS09PqaJGPr1q2RLj2mnO/fw7nC/ai0zTBYlhMAAFgHc14AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAIClEF4AAICl/D/q0dLDgaGkCQAAAABJRU5ErkJggg==",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 聚类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],\n",
       "      dtype=int64)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from hdbscan import HDBSCAN\n",
    "hdbscan_model = HDBSCAN(min_cluster_size=2)\n",
    "hdbscan_model.fit(reduced_embeddings)\n",
    "hdbscan_model.labels_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. 统计每个类的词频"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "128 ['across' 'advanced' 'affecting' 'agriculture' 'aid' 'alexa' 'algorithms'\n",
      " 'an' 'analysis' 'analyze' 'and' 'are' 'areas' 'as' 'aspect' 'assistants'\n",
      " 'be' 'bert' 'by' 'can' 'change' 'chatbots' 'clear' 'climate' 'coastal'\n",
      " 'commands' 'communication' 'complex' 'computer' 'crucial' 'currents'\n",
      " 'data' 'disaster' 'during' 'easier' 'el' 'especially' 'evenings' 'events'\n",
      " 'extreme' 'field' 'for' 'forecast' 'forecasting' 'fundamental' 'generate'\n",
      " 'global' 'gpt' 'have' 'heavy' 'help' 'helps' 'human' 'hurricanes'\n",
      " 'impact' 'important' 'in' 'influenced' 'is' 'language' 'languages' 'like'\n",
      " 'love' 'machine' 'making' 'management' 'media' 'meteorologists' 'mild'\n",
      " 'models' 'natural' 'niño' 'nlp' 'ocean' 'of' 'often' 'on' 'part'\n",
      " 'patterns' 'phenomena' 'posts' 'prediction' 'predicts' 'processing'\n",
      " 'provide' 'rain' 'real' 'relies' 'require' 'responses' 'revolutionized'\n",
      " 'satellites' 'science' 'seasons' 'semantic' 'sentiment' 'siri' 'skies'\n",
      " 'social' 'speech' 'study' 'such' 'sunny' 'sunset' 'tagging' 'task'\n",
      " 'tasks' 'techniques' 'text' 'the' 'thunderstorms' 'time' 'to' 'today'\n",
      " 'tomorrow' 'tornadoes' 'transitional' 'translation' 'understand'\n",
      " 'understanding' 'unpredictable' 'used' 'virtual' 'watching' 'weather'\n",
      " 'widely' 'with' 'worldwide']\n",
      "(2, 128) [[ 0  1  1  1  0  0  1  0  1  1  4  0  1  1  0  0  1  0  1  1  1  0  1  1\n",
      "   1  0  0  1  0  1  1  2  1  2  0  1  1  1  1  1  0  2  1  1  0  0  1  0\n",
      "   0  1  0  0  0  1  1  0  1  1  4  0  0  1  1  0  0  1  0  1  1  1  0  1\n",
      "   0  1  0  1  1  0  2  1  0  1  1  0  1  1  1  1  1  0  0  1  0  1  0  0\n",
      "   0  1  0  0  0  1  1  1  0  0  0  0  0  4  1  1  1  1  1  1  1  0  0  1\n",
      "   1  0  0  1 11  0  1  1]\n",
      " [ 1  0  0  0  1  1  1  1  3  0  3  1  0  0  1  1  0  1  0  2  0  1  0  0\n",
      "   0  1  1  0  1  0  0  0  0  0  1  0  0  0  0  0  1  0  0  0  1  1  0  1\n",
      "   1  0  1  1  2  0  0  1  7  0  4  4  1  3  0  1  1  0  1  0  0  1  2  0\n",
      "   7  0  4  0  0  1  0  0  1  0  0  2  0  0  0  0  0  1  1  0  1  0  1  1\n",
      "   1  0  1  1  1  0  0  0  1  1  1  1  1  0  0  0  2  0  0  0  0  1  1  2\n",
      "   0  2  1  0  0  1  0  0]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "topic1 = sentences[0] + sentences[1] + sentences[2] + sentences[5] + sentences[6] + sentences[8] + sentences[10] + sentences[12] + sentences[14] + sentences[16] + sentences[18]\n",
    "\n",
    "topic2 = sentences[3] + sentences[4] + sentences[7] + sentences[9] + sentences[11] + sentences[13] + sentences[15] + sentences[17] + sentences[19]\n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "X = vectorizer.fit_transform([topic1, topic2])\n",
    "\n",
    "# 查看字典\n",
    "vocab = vectorizer.get_feature_names_out()\n",
    "print(len(vocab), vocab)\n",
    "\n",
    "# 查看20个句子中，每个单词出现的次数\n",
    "#  注意, 其顺序和词汇表保持一致, 比如第一行表示, 在第一个句子中, across出现0次, advanced出现0次...\n",
    "X_arr = X.toarray()\n",
    "print(X_arr.shape, X_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 用c-TF-IDF生成主题表示\n",
    "https://github.com/MaartenGr/BERTopic/blob/424cefc68ede08ff9f1c7e56ee6103c16c1429c6/tests/test_vectorizers/test_ctfidf.py#L37"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2, 128) [[0.         0.0463128  0.0463128  0.0463128  0.         0.\n",
      "  0.03941387 0.         0.03261441 0.0463128  0.10902953 0.\n",
      "  0.0463128  0.0463128  0.         0.         0.0463128  0.\n",
      "  0.0463128  0.03541978 0.0463128  0.         0.0463128  0.0463128\n",
      "  0.0463128  0.         0.         0.0463128  0.         0.0463128\n",
      "  0.0463128  0.07882773 0.0463128  0.07882773 0.         0.0463128\n",
      "  0.0463128  0.0463128  0.0463128  0.0463128  0.         0.07882773\n",
      "  0.0463128  0.0463128  0.         0.         0.0463128  0.\n",
      "  0.         0.0463128  0.         0.         0.         0.0463128\n",
      "  0.0463128  0.         0.02600524 0.0463128  0.10402096 0.\n",
      "  0.         0.03261441 0.0463128  0.         0.         0.0463128\n",
      "  0.         0.0463128  0.0463128  0.03941387 0.         0.0463128\n",
      "  0.         0.0463128  0.         0.0463128  0.0463128  0.\n",
      "  0.07882773 0.0463128  0.         0.0463128  0.0463128  0.\n",
      "  0.0463128  0.0463128  0.0463128  0.0463128  0.0463128  0.\n",
      "  0.         0.0463128  0.         0.0463128  0.         0.\n",
      "  0.         0.0463128  0.         0.         0.         0.0463128\n",
      "  0.0463128  0.0463128  0.         0.         0.         0.\n",
      "  0.         0.13045762 0.0463128  0.0463128  0.03541978 0.0463128\n",
      "  0.0463128  0.0463128  0.0463128  0.         0.         0.03541978\n",
      "  0.0463128  0.         0.         0.0463128  0.25380399 0.\n",
      "  0.0463128  0.0463128 ]\n",
      " [0.04776008 0.         0.         0.         0.04776008 0.04776008\n",
      "  0.04064555 0.04776008 0.10090082 0.         0.08432752 0.04776008\n",
      "  0.         0.         0.04776008 0.04776008 0.         0.04776008\n",
      "  0.         0.07305329 0.         0.04776008 0.         0.\n",
      "  0.         0.04776008 0.04776008 0.         0.04776008 0.\n",
      "  0.         0.         0.         0.         0.04776008 0.\n",
      "  0.         0.         0.         0.         0.04776008 0.\n",
      "  0.         0.         0.04776008 0.04776008 0.         0.04776008\n",
      "  0.04776008 0.         0.04776008 0.04776008 0.0812911  0.\n",
      "  0.         0.04776008 0.18772533 0.         0.10727162 0.13453442\n",
      "  0.04776008 0.10090082 0.         0.04776008 0.04776008 0.\n",
      "  0.04776008 0.         0.         0.04064555 0.0812911  0.\n",
      "  0.19676422 0.         0.13453442 0.         0.         0.04776008\n",
      "  0.         0.         0.04776008 0.         0.         0.0812911\n",
      "  0.         0.         0.         0.         0.         0.04776008\n",
      "  0.04776008 0.         0.04776008 0.         0.04776008 0.04776008\n",
      "  0.04776008 0.         0.04776008 0.04776008 0.04776008 0.\n",
      "  0.         0.         0.04776008 0.04776008 0.04776008 0.04776008\n",
      "  0.04776008 0.         0.         0.         0.07305329 0.\n",
      "  0.         0.         0.         0.04776008 0.04776008 0.07305329\n",
      "  0.         0.0812911  0.04776008 0.         0.         0.04776008\n",
      "  0.         0.        ]]\n"
     ]
    }
   ],
   "source": [
    "from bertopic.vectorizers import ClassTfidfTransformer\n",
    "class_vectorizer = ClassTfidfTransformer()\n",
    "c_tf_idf = class_vectorizer.fit_transform(X).toarray()\n",
    "print(c_tf_idf.shape, c_tf_idf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "weather the and\n",
      "nlp in language\n"
     ]
    }
   ],
   "source": [
    "# 提取重要的词，提取前三个\n",
    "import numpy as np\n",
    "topic1_important_words_index = np.argsort(c_tf_idf[0])[::-1] # 获取tf-idf值最高的下标\n",
    "print(vocab[topic1_important_words_index[0]], vocab[topic1_important_words_index[1]], vocab[topic1_important_words_index[2]])\n",
    "\n",
    "topic2_important_words_index = np.argsort(c_tf_idf[1])[::-1]\n",
    "print(vocab[topic2_important_words_index[0]], vocab[topic2_important_words_index[1]], vocab[topic2_important_words_index[2]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 其实到现在为止，我们已经拼装出来了BERTopic\n",
    "# 如果我们使用BERTopic实现上述代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Topic</th>\n",
       "      <th>Count</th>\n",
       "      <th>Name</th>\n",
       "      <th>Representation</th>\n",
       "      <th>Representative_Docs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>0_weather_the_and_is</td>\n",
       "      <td>[weather, the, and, is, during, for, patterns,...</td>\n",
       "      <td>[The weather today is sunny with clear skies.,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>1_nlp_in_of_language</td>\n",
       "      <td>[nlp, in, of, language, is, analysis, and, nat...</td>\n",
       "      <td>[Natural language processing is a field of stu...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Topic  Count                  Name  \\\n",
       "0      0     11  0_weather_the_and_is   \n",
       "1      1      9  1_nlp_in_of_language   \n",
       "\n",
       "                                      Representation  \\\n",
       "0  [weather, the, and, is, during, for, patterns,...   \n",
       "1  [nlp, in, of, language, is, analysis, and, nat...   \n",
       "\n",
       "                                 Representative_Docs  \n",
       "0  [The weather today is sunny with clear skies.,...  \n",
       "1  [Natural language processing is a field of stu...  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from bertopic import BERTopic\n",
    "# 20个句子，一部分讨论天气，一部分讨论自然语言处理\n",
    "sentences = [\n",
    "    \"The weather today is sunny with clear skies.\",\n",
    "    \"Tomorrow's forecast predicts heavy rain and thunderstorms.\",\n",
    "    \"I love watching the sunset during mild weather evenings.\",\n",
    "    \"Natural language processing is a field of study in computer science.\",\n",
    "    \"NLP techniques are widely used in text analysis and language understanding.\",\n",
    "    \"Weather forecasting relies on complex algorithms and data analysis.\",\n",
    "    \"Understanding weather patterns is crucial for agriculture and disaster management.\",\n",
    "    \"NLP algorithms can help in sentiment analysis of social media posts.\",\n",
    "    \"The weather can be unpredictable, especially during transitional seasons.\",\n",
    "    \"NLP models like BERT and GPT-3 have revolutionized language understanding tasks.\",\n",
    "    \"Extreme weather events like hurricanes and tornadoes require advanced prediction models.\",\n",
    "    \"NLP is used in virtual assistants like Siri and Alexa to understand human commands.\",\n",
    "    \"Climate change is affecting global weather patterns.\",\n",
    "    \"NLP can aid in machine translation, making communication across languages easier.\",\n",
    "    \"Weather satellites provide real-time data for meteorologists to analyze.\",\n",
    "    \"Semantic analysis is an important aspect of natural language processing.\",\n",
    "    \"Weather phenomena such as El Niño impact weather worldwide.\",\n",
    "    \"Part of speech tagging is a fundamental task in NLP.\",\n",
    "    \"The weather in coastal areas is often influenced by ocean currents.\",\n",
    "    \"NLP helps in chatbots to generate human-like responses.\"\n",
    "]\n",
    "\n",
    "# 核心代码是下面这一行，我们创建了一个BERTopic模型，其实就是在背后创建了：Embedding模型、UMAP降维模型、HDBSCAN聚类模型、CountVectorizer分词模型、c-tf-idf主题表示模型，然后从上到下将它拼装在了一起\n",
    "# 这个min_topic_size=2，本质上是传递给HDBSCAN聚类模型的，说的是：一个类中至少需要包含两个文档\n",
    "# 事实上我们在这里传给BERTopic()的参数，绝大多数都会传递给背后的Embedding、UMAP、HDBSCAN等具体模型\n",
    "topic_model = BERTopic(min_topic_size=2) \n",
    "\n",
    "topic_model.fit_transform(sentences) # 拟合模型\n",
    "topic_model.get_topic_info() # 获取主题聚类信息\n",
    "# 可以看到得出的topic信息基本类似"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 还可以用另外一种方式调用BERTopic\n",
    "改编自官方示例 https://maartengr.github.io/BERTopic/algorithm/algorithm.html#code-overview"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Topic</th>\n",
       "      <th>Count</th>\n",
       "      <th>Name</th>\n",
       "      <th>Representation</th>\n",
       "      <th>Representative_Docs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>0_weather_the_and_is</td>\n",
       "      <td>[weather, the, and, is, during, for, patterns,...</td>\n",
       "      <td>[The weather today is sunny with clear skies.,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>1_nlp_in_of_language</td>\n",
       "      <td>[nlp, in, of, language, is, like, analysis, an...</td>\n",
       "      <td>[Natural language processing is a field of stu...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Topic  Count                  Name  \\\n",
       "0      0     11  0_weather_the_and_is   \n",
       "1      1      9  1_nlp_in_of_language   \n",
       "\n",
       "                                      Representation  \\\n",
       "0  [weather, the, and, is, during, for, patterns,...   \n",
       "1  [nlp, in, of, language, is, like, analysis, an...   \n",
       "\n",
       "                                 Representative_Docs  \n",
       "0  [The weather today is sunny with clear skies.,...  \n",
       "1  [Natural language processing is a field of stu...  "
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "from umap import UMAP\n",
    "from hdbscan import HDBSCAN\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from bertopic.vectorizers import ClassTfidfTransformer\n",
    "from bertopic import BERTopic\n",
    "\n",
    "sentences = [\n",
    "    \"The weather today is sunny with clear skies.\",\n",
    "    \"Tomorrow's forecast predicts heavy rain and thunderstorms.\",\n",
    "    \"I love watching the sunset during mild weather evenings.\",\n",
    "    \"Natural language processing is a field of study in computer science.\",\n",
    "    \"NLP techniques are widely used in text analysis and language understanding.\",\n",
    "    \"Weather forecasting relies on complex algorithms and data analysis.\",\n",
    "    \"Understanding weather patterns is crucial for agriculture and disaster management.\",\n",
    "    \"NLP algorithms can help in sentiment analysis of social media posts.\",\n",
    "    \"The weather can be unpredictable, especially during transitional seasons.\",\n",
    "    \"NLP models like BERT and GPT-3 have revolutionized language understanding tasks.\",\n",
    "    \"Extreme weather events like hurricanes and tornadoes require advanced prediction models.\",\n",
    "    \"NLP is used in virtual assistants like Siri and Alexa to understand human commands.\",\n",
    "    \"Climate change is affecting global weather patterns.\",\n",
    "    \"NLP can aid in machine translation, making communication across languages easier.\",\n",
    "    \"Weather satellites provide real-time data for meteorologists to analyze.\",\n",
    "    \"Semantic analysis is an important aspect of natural language processing.\",\n",
    "    \"Weather phenomena such as El Niño impact weather worldwide.\",\n",
    "    \"Part of speech tagging is a fundamental task in NLP.\",\n",
    "    \"The weather in coastal areas is often influenced by ocean currents.\",\n",
    "    \"NLP helps in chatbots to generate human-like responses.\"\n",
    "]\n",
    "\n",
    "# 下面更可以看出，BERTopic是各个模型的堆叠组合⭐\n",
    "# 我们可以选择每个模型都自己创建，以方便调节其参数\n",
    "\n",
    "# Step 1 - Extract embeddings\n",
    "embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
    "\n",
    "# Step 2 - Reduce dimensionality\n",
    "umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')\n",
    "\n",
    "# Step 3 - Cluster reduced embeddings\n",
    "hdbscan_model = HDBSCAN(min_cluster_size=2)\n",
    "\n",
    "# Step 4 - Tokenize topics\n",
    "vectorizer_model = CountVectorizer()\n",
    "\n",
    "# Step 5 - Create topic representation\n",
    "ctfidf_model = ClassTfidfTransformer()\n",
    "\n",
    "# All steps together\n",
    "topic_model = BERTopic(\n",
    "  embedding_model=embedding_model,          # Step 1 - Extract embeddings\n",
    "  umap_model=umap_model,                    # Step 2 - Reduce dimensionality\n",
    "  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings\n",
    "  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics\n",
    "  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words\n",
    ")\n",
    "\n",
    "topic_model.fit_transform(sentences) # 拟合模型\n",
    "topic_model.get_topic_info() # 获取主题聚类信息"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
